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ABSTRACT 

As biotechnology advances rapidly, a tremendous 
amount of cancer genetic data has become avail- 
able, providing an unprecedented opportunity for 
understanding the genetic mechanisms of cancer. 
To understand the effects of duplications and dele- 
tions on cancer progression, two genomes (normal 
and tumor) were sequenced from each of five 
stomach cancer patients in different stages (I, II, III 
and IV). We developed a phylogenetic model for 
analyzing stomach cancer data. The model 
assumes that duplication and deletion occur in ac- 
cordance with a continuous time Markov Chain 
along the branches of a phylogenetic tree attached 
with five extended branches leading to the tumor 
genomes. Moreover, coalescence times of the 
phylogenetic tree follow a coalescence process. 
The simulation study suggests that the maximum 
likelihood approach can accurately estimate param- 
eters in the phylogenetic model. The phylogenetic 
model was applied to the stomach cancer data. 
We found that the expected number of changes 
(duplication and deletion) per gene for the tumor 
genomes is significantly higher than that for 
the normal genomes. The goodness-of-fit test 
suggests that the phylogenetic model with 
constant duplication and deletion rates can 
adequately fit the duplication data for the normal 
genomes. The analysis found nine duplicated 
genes that are significantly associated with 
stomach cancer. 



INTRODUCTION 

Cancer is one of the leading causes of death in Americans 
(1). Cancer research has led to a variety of effective 
treatments and diagnostic techniques for cancers. Yet, 
the fundamental genetic mechanisms that turn normal 
cells into tumors remain mysterious. Advances in the bio- 
technology field have provided an unprecedented oppor- 
tunity for understanding the origin and progression of 
cancer (2-4). The availability of genetic data ignites the 
hope that we may discover the genetic mechanisms of 
cancer by examining the genetic differences between 
normal and cancer genomes (5). It is, however, a 
challenging task to effectively analyze such genetic data 
by modeling the genetic variation observed within and 
between the normal and cancer groups (6). Previous 
studies have demonstrated that cancer progression is an 
evolutionary process in which mutation and natural selec- 
tion are two key factors (7,8). Mutation causes genetic 
variation among normal cells that can trigger cancer (9). 
On the other hand, selection plays an important role in 
therapeutic resistance (10-12) and in the birth and death 
process of cancer cells, as cancer cells vary and the fittest 
ones survive after competition (13). 

In the last few decades, theory from evolution and 
ecology has been adapted in cancer studies to investigate 
the genetic mechanisms of cancer (14-18). Muto et al. (19) 
studied colon cancers and found that most colon cancers 
have evolved from adenomatous polyps known as polyp- 
cancer sequences. Evolutionary ideas have been explored 
in many cancer analyses (20-22). Nowell (23) proposed a 
landmark colonial evolution model for tumor progression, 
which assumes that most neoplasms originate from a 
single cell. Gillies et al. (24) proposed an evolutionary 
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model for malignant cancers, in which the micro-environ- 
mental forces such as hypoxia can stimulate genetic in- 
stability and impose selection pressures on cancer cells. 
Recently, Wu (25) investigated the evolution of cancer 
cells after the primary tumor had spread to secondary 
sites (26). Ultimately, cancer evolution within an individ- 
ual can be viewed genetically as adaptation to a new life- 
style and ecologically in the context of the other cell types 
and resources available in an individual. 

Heterogeneity of cancer caused by genetic instability is 
the main challenge in the process of understanding cancer 
evolution and in the process of identifying driver genes (8). 
Due to this challenge, cancer data sets often lack signal 
regarding the evolutionary process of cancer. It is difficult 
to find genomic mutations/events that trigger cancer, espe- 
cially those that trigger the early-stage cancer. High 
throughput technologies, particularly next generation 
sequences (NGS) provide researchers with new 
opportunities to understand the evolutionary process of 
cancer development at a single cell nucleotide level 
(16,27-29). NGS technology is able to identify alterations 
in the genome, e.g. chromosomal rearrangement and copy 
number variation, rather than point mutations; and can 
sequence genetic material from lower-frequency samples 
(30). Because of the advantages of NGS data, the NGS 
technology has been extensively used in cancer studies to 
examine genetic mechanisms that cause cancer 
progression. 

Gene duplication is believed to play an important role 
in tumor progression (31). Duplicated genes have been 
frequently observed in the genomes of cancer patients. 
Waris and Ahsan (32) suggested that gene duplication 
and other changes in DNA may be involved in the initi- 
ation of various cancers. Previous studies found that there 
is a strong correlation between gene duplication and large 
tumor size, indicating that gene duplication may play a 
critical role in tumor progression (33). However, the 
information at early stages of cancer is usually unavailable 
and little is known about the relationship between gene 
duplication and early-stage cancer. 

The primary goal of the study is to investigate the effects 
of gene duplication and deletion on the incidence and pro- 
gression of cancers. Specifically, this study aims to estimate 
the duplication and deletion rates on normal and tumor 
genomes, and to identify duplicated genes that are highly 
associated with stomach cancer. We have developed a 
probabilistic model in the context of coalescent trees of 
normal genomes attached with five tumor genomes for 
understanding how gene duplication and deletion are 
related to different stages of cancer as cancer progresses. 
A maximum likelihood approach is adopted to estimate 
model parameters, including duplication and deletion 
rates. This approach can identify duplicated genes that 
are significantly associated with stomach cancer. 

MATERIALS AND METHODS 

Genome annotation and duplication data 

The genomic data was obtained from five stomach cancer 
patients (34). Two samples (tumor and normal) were 



Table 1. The number of duplicated genes on the 
genomes of five stomach cancer patients 



Subject 


Stage 


No of duplicated genes 
in tumor genomes 


SI 


1 


64 


S2 


11 


84 


S3 


11 


57 


S4 


111 


75 


S5 


IV 


72 



taken from each patient; tumor tissues were surgically 
removed from part of the patients' stomachs, while 
blood samples were extracted as normal tissues from the 
same patients (34). Determination of pathologic stages of 
tumor tissues is based on the standards recommended by 
World Health Organization (WHO). Pathological exam- 
ination suggested that two patients were in stage II of 
stomach cancer, while remaining three patients were in 
stages I, III and IV (Table 1). Stomach cancer has two 
subtypes in terms of the genome instability — micro satel- 
lite instability (35) and chromosome instability (36). 
However, the subtypes of the stomach cancer for five 
patients in this study are not available in (34). The 
genomes were sequenced for each of the two samples 
(normal and tumor). Both the normal and tumor 
genomes were compared with the human reference 
genome to identify duplicated genes. As the human refer- 
ence genome is a haploid sequence, it may result in 
underestimation of duplication events. High-confidence 
duplication events were identified if a junction in the 
genomic data satisfied all of the following criteria: (i) at 
least 10 mate-pairs in cluster for its junction, (ii) successful 
de novo assembly of the junction, (hi) high mapping diver- 
sity with both left length and right length no less than 70 
and (iv) absence of specific repeat sequences on left and 
right side of junction. As it is assumed that duplication 
events occur independently among genes, the junctions 
that covered more than two genes were excluded. With 
these criteria, we identified 210 genes on which duplication 
occurred for at least one of the 10 genomes (5 normal and 
5 tumors). We use T' to denote duplication and '0' to 
denote no duplication. Cui et al. (34) did not estimate 
the total number of genes in the genomes of five 
patients. We used the estimate from ENCODE (37) that 
the total number of genes in the human genome is 21 000. 
Because the most significant inferences are based on the 
relative duplication and deletion rates, uncertainty in the 
total number of human genes does not affect the major 
conclusions of the data analysis. In summary, the data 
matrix D has 21 000 rows and 10 columns; each row rep- 
resents a gene and each column represents a genome 
(normal or tumor). The entries in the matrix D are 
either 0 (no duplication) or 1 (duplication). 

A phylogenetic model for gene duplication 

The stochastic process of gene duplication and deletion 

The process of gene duplication and deletion is a continu- 
ous time Markov process with two states 1 (duplication) 
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and 0 (no duplication). Let d(t) denote the Markov 
process on states 0 and 1. We assume that transition 
probabilities Py{?) are stationary and the infinitesimal 
duplication and deletion rates are a and b, respectively. 
The probability of a duplication event (and a deletion 
event) during a time period of duration At, as A? — > 0, 
is (38,39) 



P(d(t+A t ) = 1|4/) = 0) = aA,+o(A,) and 
P(d(t+A,) = 0\d(t) = 1) = bA,+o(A,) 



(1) 



The notation o(At) indicates lim A ^o o(At)/At — 0. 
The probability distribution P(t) of d(t) can be derived 
from theory of Markov process. Let T = {a + b)t and 
m = a/(a + b). The transition probabilities P,y(/) are given 
as follows: 

P^,d 2 (T) = \d^+d 2 - 1|+(1 - 2d l )(2d 2 - lW~ x < 

x (1 - m) Xl (l - e- T ), for d u d 2 = {0,1}. 



(2) 



As oo, 
Po,o(T) -> 1 - 



m, Pn, i(T) — ► m, Pi^(T) — »■ 1 — m, and 



(3) 



Thus the limiting distribution as T ^ oo is P(d(T) = 0) = 
1 — m and P{d(T) = 1) = m. This model is time reversible, 
in the sense that P(d(t) = 0, d(t +T) = 1) = = 1, 

d(t+T) = 0). 

77i£ likelihood function under the phylogenetic model 

A phylogenetic tree with five extended branches describes 
the history of 10 genomes of five cancer patients. The 
five normal genomes are at the tips of the tree, which 
are attached with five extended branches leading to 
the tumor genomes (Figure 1). The tree without the 
extended branches represents the history of five normal 
genomes, while the extended branches represent the 
duplication/deletion process that leads to the tumor 
genomes. Each tumor progression was treated as an 
independent process, even though the tumors progressed 
through the same stages phenotypically. As described 
above, the normal and tumor sequences are coded as 
binary data (0: no duplication or 1: duplication). Each 
site of the sequences contains duplication status of a gene 
across five patients. Parameters of the phylogenetic 
model include the topology of the tree, branch lengths 
T h and parameter m = a/(a + b). Let M be the number of 
branches in the tree and W be the number of 
extended branches. We assume that m is constant on 
the main branches of the tree, but the extended 
branches have variable (relative) duplication rates {m e k , 
k=\,..., W). Given a phylogenetic tree 5 (topology, 
branch lengths T and parameter m) and the extended 
branches (branch lengths T' and m e ), the probability 
distribution of data matrix D can be derived from 
the transition probability function in (2). Let dy and 
d*ij be the duplication status of gene j at the two ends 
of branch i (with length Tj) in tree 5" with topology x. 




Figure 1. The tree in the phylogenetic model. The normal genomes (N) 
at the tips of the tree are attached with five extended branches leading 
to the tumor genomes (T). N; and T; are the normal and tumor 
genomes of patient j. The tree (above normal genomes) represents the 
history of five normal genomes, while the extended branches represent 
the process leading to the five tumor genomes. The normal genomes are 
the ancestral genomes of the tumor genomes. 



It follows from (2) that the probability of dy and d~, 
given branch length T h parameter m, and tree topology 
t, is 



P{dij,d~\Ti,m, r) 



dy + d^ 



+(1 - 2<%)(24 - lV"*(l 



(4) 



Given the states of the internal nodes, the Markov 
processes on different branches of the phylogenetic tree 
are independent of one another. The probability distribu- 
tion function of duplication events on gene j (denoted 
by Dj) is the product of the probabilities for individual 
branches in (4), i.e., 

f M 



;=1 



P(D h I\ T, m, r, V, m e ) = j \\ P{d ih d* f | T u m, r) 

' w 

YlP(c k j,c* kj \r k ,ml) 



ik=\ 



(5) 



In (5), / denotes the duplication status at the internodes 
of the tree. The first term in (5) is the probability of 
duplication events in normal genomes, given the phylo- 
genetic tree without the extended branches. The second 
term is the probability of duplication events in tumor 
genomes, given the extended branches, in which c kj 
and cjL are the duplication status of gene j at the 
two ends of the extended branch k. The probability 
function P(Dj, I \ T, m, t, V , m e k ) in (5) assumes that 
the duplication status at the internodes of the tree are 
given. Because in reality / is often not given, we calculate 
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P{Dj | T, m, t, T', m e ), which is the sum over all possible 
realizations of /, i.e., 

P(Dj\ T, m, r, r, m e ) = P{Dj, I\ T, m, r, T, m e ). (6) 



duplication states at the internal nodes of the 
phylogenetic tree, 



p(D j \m,6) = Y,P{D j ,I\m,( 



The probability distribution P(D ; - | T, m, x, T\ m e ) in (6) 
can be efficiently calculated by a peeling technique 
described by Felsenstein (40). 

In the context of population genetics, the phylogenetic 
tree of n individuals varies over different loci due to the 
coalescent. Let t = {tj, j = 2, . . . , n) be the waiting times 
until the next coalescence event. Let 6 = AuN e be the 
population size parameter, in which N e is the effective 
population size and is the change (duplication and 
deletion) rate per gene. According to the coalescent 
theory, the waiting times t/s are independently distributed 
with the exponential density 



MO) = 



(7) 



The expected coalescent time for a haploid genome from 
two individuals chosen at random from the human popu- 
lation is E(t 2 ) = 6/2, which indicates that if the sequences 
of a gene are sampled from one of the genomes of two 
individuals chosen at random from the human population, 
the expected duplication probability E{P 0 ,;(T)) is equal to 
the probability P 0 ,/(T), averaging over coalescence time T, 
which has an exponential density described in (7), i.e., 



00 

E(P 0tl (Tj) = J m{\ 



e- 1T )-e- 1T ' 9 dt = 



m6 
6+1 



(8) 



Similarly, the expected deletion probability is E(Pi ! o(T)) = 
The expected number of changes per gene 



between two genomes chosen at random from the 
human population is 

E(P( Xl =0,x 2 = l\T)) = 

OO 

/ (1 _ m)m{l _ e-^) 2 e^d T J X - m)m6 , (9) 
J 6 0+1 ' 

o 

in which X\ represents the duplication status of a gene in 
one of the two genomes, and x 2 represents the duplication 
status of the same gene in the other genome. 

The parameters 6 and m are estimated by averaging 
over gene trees, in which branch lengths T are the sum 
of a set of coalescence waiting times t with a density 
function described in (7). The probability of observing 
certain duplication states (D for current individuals 
and / for their ancestors) of a gene is then equal to the 
likelihood in (5) (without the extended branches) 
averaging over coalescence waiting times t i.e., 



P(D h I\m,6) = j YlP^dtmrn 



x)f{t\6)dt. 



The probability P(Dj \ m, 6) for a single locus is equal 
to the probability in (5) summing over all possible 



Since probability P{Dj \ m, 6) under the coalescent 
model is invariant to the order of the duplication 
states of individuals, the relevant random variable here 
is the number of duplications across individuals. 
When there are n individuals, the number of individuals 
who have duplication for a particular gene could be 0 
up to n. Let {x h i = 0, . . . , «} be the number of genes 
for which i individuals have duplications. The sum of 
Xj's (N) is the total number (21 000) of genes considered 
in this study. We use {pj, i = 0, . . . , n} to denote the 
probability of observing i individuals with duplication. 
Thus {xj, i = 0, . . . , n] follows a multinomial distribu- 
tion, i.e., 



P(x | m, ( 



x Q \...x n \*_ 0 



Because the multinomial coefficient does not involve 
model parameters, we delete this term and write the log- 
likelihood function as 



l(m,6) = Y^Xilogpi 



(10) 



<-o 



In this equation, p t is a function of m and 6, which will 
be derived as follows under the coalescent theory. 
Considering a simple case of two individuals, the coales- 
cence time / has an exponential distribution with density 
2e~ 2r/e /6. Let y be the duplication state at the root, and z x 
and z 2 be the duplicate states of two individuals at the 
tips of the tree. As there are only two states for y, z x 
and z 2 , the domain of y, zj and z 2 has only two values, 
0 and 1. The goal is to derive the probabilities of (zj = 0, 
z 2 = 0), (z, =0, z 2 = 1), (z, = 1, z 2 = 0) and (z, = 1, 
z 2 = 1). We assume that the states at the root have the 
equilibrium distribution with probability mass function 
P(y = 0) = \ —m and P(y = 1) = m. The probability of 
(Z[ = 0, z 2 = 0) is 

P(z, = 0, z 2 = 0) = P(z, = 0, z 2 = 0\y = 0) 

x P(y = 0)+P(zi = 0,z 2 = 0|j; = 1) x P(y = 1) 

OO 

= (1 - m) J 2(1 - m+me-'fe- 1 ' 16 ' /6dt 



1=0 



oo 

*»/*>-"iKi-OV^/« 

= (1 -mf+m{ \ -m)/(6+l). 

Similarly, P{z { =0, z 2 = 1) = P(z, = 1, z 2 = 0) = 
6m(\-m)/(6+ 1), and P(z, = 1, z 2 = 1) = m 2 + m(\-m)/ 
(6 + 1). Thus we have p 0 = (1 - mf + m(l - m)/(6 + 1), 



-'\2 a -2t/e , 
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pi = 29m(l-m)/(0+ 1), and p 2 = m 2 + m(l - m)/(0+ 1). 
The log-likelihood function becomes 

l(m,6) = x 0 log{(l - m) 2 +m(l - m)/(6+l)} 

+ x x log{2Qm{\ - m)/(G+\)} (11) 
+ x 2 log{m 2 +m{\ - m)/{6+\)}. 

For an arbitrary number of individuals, we use an iterative 
algorithm (Supplementary Material SI) to calculate prob- 
ability pi. The maximum likelihood estimates of 6 and 
m are obtained by using the L-BFGS-B algorithm (41) 
implemented in an R optimization function optim. In 
addition to the estimates of model parameters, function 
optim outputs the hessian matrix (also called observed 
Fisher information matrix) that can be used to calculate 
the variances of the estimates. 

Additionally, equation (5) implies that duplication 
processes occurring on the extended branches are condi- 
tionally independent of those occurring on the other 
branches of the tree. Thus, parameters on the extended 
branches can be estimated separately. Let {a k ; k = 1, . . . , 
W) and {b k ; k= 1, . . . , W] be the duplication and deletion 
rates on the W extended branches. The ratio parameter is 
m k = a k /(a k + b k ) and the branch length is 7| = (a k + b k )t k . 
Parameters {m k , J\\ k=l,..., W} on the extended 
branches can be estimated from the empirical frequencies 
of observations 00, 01, 10 and 1 1 on the normal and tumor 
genomes of each patient. The two digits are the duplica- 
tion status of the genes on the normal and tumor genomes, 
respectively, from the same patient. Let n 0 o, n 0 i, n 10 , n n 
be the count of the genes with pattern 00, 01, 10 and 11, 
respectively. The count n 0 i has binomial distribution with 
p 0l = nf(l — e~ r ) and n Q = n 00 + noi, in which V is the 
length of the extended branch. Similarly, the count ni 0 
has binomial distribution with probability p w = (1 — rrf) 
(l—e~ r ) and n x = n 10 + n n . The maximum likelihood 
estimate of p m is n 0 i/n 0 , i.e., pa = m c {\ — e~ r ) « «oi/«o 
and pio = (1 — m e )(\ — e~ r ) « «io/«i- Thus, the esti- 
mates of rrf and J 6 are given by rrf = — — and 

T = -log\\ -^4. 
Simulation study 

To evaluate the performance of the phylogenetic model 
developed in the previous section, duplication and 
deletion events were simulated from the phylogenetic 
model. The values of parameters (m, 6) were set to (0.01, 
0.01), (0.01, 0.1), (0.3, 0.01), (0.3, 0.1), respectively. For 
each parameter setting, we simulated duplication and 
deletion events for 1000, 5000 and 10000 genes. The 
simulated data were then used to estimate parameters 
(m, 0) in the phylogenetic model. Each simulation was 
repeated 10 times, and the square root of mean square 
error (RMSE) between the estimate and the true value 
of the model parameter was calculated. Let 6 be the 

estimate of parameter 6. The RMSE is JjEL ~ ^ 

where g is the number of simulations and 0, is the estimate 
of 6 for the i-th simulation. Overall, the results show that 
the RMSEs of parameters m and 6 decrease as the number 



(a) Estimating m 
0.018 n 




1000 5000 10000 

number of genes 



(b) Estimating 0 
0.03 -. 

C 




1000 5000 10000 

number of genes 

Figure 2. Simulation results. The square root of the mean square error 
for (a) estimating parameter m, and for (b) estimating parameter 8. 



of genes increases for all parameter settings (Figure 2). 
The RMSE of m depends on not only the value of m, 
but also the value of 6 and the number of genes. It 
appears that m has a smaller RMSE when 6 is large 
(6 = 0.1) at 1000 genes (Figure 2a). But this pattern is 
reversed for m = 0.3 at 5000 and 10000 genes. In 
contrast, 6 consistently has a smaller RMSE when m is 
large (Figure 2b). This may be caused by the fact that a 
large m tends to generate more duplications in the 
simulated data. Thus it is relatively straightforward to 
estimate 6 when m is large. For all parameter settings, 
the RMSEs of m and 9 decrease to a reasonable level 
(<0.008), when the number of genes reaches 10000. 



RESULTS 

In the stomach cancer data set, there are 210 genes on which 
duplication has occurred for at least one of the 10 genomes. 
The remaining 20 790 genes have the pattern of (0,0,0,0, 
0,0,0,0,0,0), given that the total number of genes on the 
human genomes is 21 000 (37). The number of duplicated 
genes in the tumor genomes varies across the stages of 
stomach cancer (Table 1). There is a high degree of individ- 
ual variation in this data, although this might suggest that 
duplication and deletion rates may vary across different 
stages of stomach cancer. 
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We use the phylogenetic model to estimate and compare 
the overall changes (duplications and deletions) on the 
normal genomes and tumor genomes. We expect that 
the overall changes in the tumor genome are significantly 
higher than those in the normal genomes. We also inves- 
tigate the pattern (increasing or decreasing) of the dupli- 
cation and deletion rates across cancer stages. Finally, 
we identify significant duplication and deletion events 
associated with tumor genomes. As duplication and 
deletion rates depend on the total number of genes in 
the human genomes, the duplication and deletion rates 
estimated from the stomach cancer data set are relative 
rates. Moreover, the strategies of identifying duplication 
events on the genomes of five patients may underestimate 
the number of duplications. Because underestimation 
occurs for both normal and tumor genomes, it will not 
affect the gross conclusions based on the comparison of 
the relative duplication and deletion rates on normal and 
tumor genomes, or among the tumor genomes in different 
cancer stages. 

We first assumed a fix tree for all genes (described in 
Figure 1), and used a Bayesian approach (Supplementary 
Material S2) to estimate the phylogenetic tree and the du- 
plication and deletion rates. The Bayesian estimate of the 
phylogenetic tree is poorly supported with all posterior 
probabilities <0.4 (Supplementary Figure S3c). The low 
posterior probabilities for the nodes in the Bayesian tree, 
despite such a large number of observations, suggest that 
there is not a single tree generating the empirical data. 
Thus, we modeled the gene trees in the context of popu- 
lation genetics using the coalescent theory as described 
above, and calculated the maximum likelihood estimates 
of model parameters m and 6. 

The maximum likelihood estimates of model parameters 

The model parameters m and 6 were estimated by 
maximizing the log-likelihood function described in 
equation (10). We used the L-BFGS-B algorithm (41) 
implemented in the R optimization function optim to 
maximize the log-likelihood function in equation (10). 
The estimate of 6 for tumor genomes is twice as high as 
that for normal genomes (Table 2). The expected number 
of changes per gene for the tumor genomes is 
m(l - m)0/(l+0) = 0.0022 (see equation (9)), which is 
significantly higher than that (0.0012) for the normal 
genomes. This result suggests that the number of 
changes (duplications and deletions) in the tumor 
genomes is significantly greater than the number of 
changes in the normal genomes. 

The goodness of fit of the phylogenetic model was 
evaluated by the chi-square goodness-of-fit test 



Table 2. The maximum 


likelihood estimates of m 


and 6 for normal 


and tumor genomes 








m (SE) 


d (SE) 


Normal 


0.0028 (0.0002) 


0.7134 (0.1024) 


Tumor 


0.0037 (0.0003) 


1.4750 (0.1881) 



The values within parentheses are standard errors of the estimates. 



implemented in an R function chisq.test. The observed 
counts of genes for which 0 up to 5 individuals have 
duplication were calculated for the normal and tumor 
genomes (Table 3). Moreover, the expected count of 
genes for which i individuals have duplication equals 
21 000 x pi, in which the probability p-, of observing i 
individuals with duplication was obtained from the 
L-BFGS-B algorithm described in the previous section. 
The probabilities {p 0 , p\, p 2 , P3, P4, ps} for the normal 
genomes are 0.9942, 0.0021, 0.001 1, 0.0008, 0.0007, 
0.0008, respectively. The chi-square test cannot reject the 
phylogenetic model for the normal genomes, with 
P-value = 0.828 (Table 3). In contrast, the probabilities 
{po-, Pi, P2, P^ P4, Pi} for the tumor genomes are 0.9903, 
0.0049 0.0022, 0.0012, 0.0007, 0.0004, respectively. The 
phylogenetic model for tumor genomes is strongly 
rejected by the chi-square test, with P-value < 10~ 6 
(Table 3). The phylogenetic model assumes constant 
duplication and deletion rates across branches of the 
tree. However, duplication and deletion rates may be 
highly variable in different stages of stomach cancer, 
and thus the assumption of constant duplication and 
deletion rates may be seriously violated when modeling 
tumor genomes in different stages of cancer. To take 
into account variable duplication and deletion rates, we 
separately fit the two-states duplication and deletion 
model to each of the external branches. As we expected, 
duplication and deletion rates vary across external 
branches (Table 4). Overall, the deletion rates are much 
higher than the duplication rates on the extended branches 
(Table 4), suggesting that deletion occurred more often 
than duplication in tumor genomes. The duplication and 



Table 3. The chi-square goodness-of-fit test for the phylogenetic 
model 



No of Normal Tumor 
patients with 



duplication Observed 


Expected 


Observed 


Expected 


counts 


counts 


counts 


counts 


0 20 885 


20 880 


20803 


20 798 


1 46 


44.6 


129 


103.2 


2 23 


24.1 


26 


46.4 


3 18 


17.8 


16 


26.9 


4 10 


15.6 


7 


16.4 


5 18 


17.6 


19 


8.9 


/'-value = 


0.8283 


/"-value = 


7.3e-07 


Table 4. The estimates of relative duplication and deletion rates on 


the extended branches 









Duplication rate Deletion rate 



Tl 


0.0013 


0.2626 


T2 


0.0024 


0.2079 


T3 


0.0007 


0.2412 


T4 


0.0009 


0.1443 


T5 


0.0017 


0.2817 
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deletion rates appear not to have either an increasing or 
decreasing pattern associated with cancer stages. 

Identifying cancer-related duplicated genes 

Let x denote the duplication status of a gene, with x = 1 
referring to the cases where the gene collected from the 
tumor tissue is duplicated, while the gene collected from 
the normal tissue of the same patient is not duplicated. If 
duplication of the same gene is observed on a large 
number of tumor genomes, it is strong evidence that the 
duplicated gene is associated with tumor. We call this type 
of duplication 'cancer-related duplication' (occurring on 
the tumor genome, but not on the normal genome). Let 
y be the number of cancer-related duplications for a par- 
ticular gene observed in the genomes of five patients. For 
the stomach cancer data, the value of y can be 0 up to 5. 
Under the null hypothesis that the duplication of a par- 
ticular gene in the tumor genome is normal, we expect that 
the observed cancer-related duplication probability of a 
gene will be similar to the duplication probability in 
normal genomes. Thus, a duplicated gene is associated 
with cancer if the observed probability is significantly 
higher than the duplication probability in normal 
genomes. Given that m = 0.0028 and Q = 0.7134, the 
average duplication probability in normal genomes is 
p = m6l(\ +6) = 0.0012 (see equation (8)). Under the 
null hypothesis, the random variable y (number of dupli- 
cations) has a binomial distribution with P = 0.0012 and 
n = 5. The null hypothesis was rejected for nine duplicated 
genes (CDH4, CLPS, CLSTN2, EML5, NPEPL1, 
SENP5, SPTB, VAMP7, XAGE-4), with the overall 
P-value < 0.05 adjusted by Bonferroni correction for 
multiple comparisons (Table 5). The same list of genes 
was identified when the duplication probability p was 
calculated from the two ends of the 95% confidence 
interval (mean ± 2SE) of m and 6. Similarly, the 
frequencies of deleted genes on the tumor genomes are 
compared with the deletion probability in normal 
genomes. If the observed frequency of deletions is signifi- 
cantly higher than the deletion probability in the normal 
genomes, we conclude that the deletion is significantly 
associated with cancer. We did not find any deletion 
that is significantly associate with cancer. 

The functional annotation of nine significantly 
duplicated genes (Supplementary Table SI) was conducted 
by the DAVID web server (42). The analyses generated 



Table 5. Identification of duplicated genes associated with cancer 



Number of 


Number 


P-value 


Cumulative 


Significance 


duplications 


of genes 




P-value 




0 


20 853 


1.0 


1.0 




1 


109 


0.006 


>0.5 




2 


7 


1.4e-05 


le-04 


* 


3 


1 


1.7e-8 


1.7e-08 


* 


4 


0 


le-11 


2.4e-15 


* 


5 


1 


2.4e-15 


2.4©- 15 


* 



The significant genes are indicated by *. The cumulative P-value was 
adjusted with Bonferroni correction for multiple comparisons. 



two significant annotation clusters (Supplementary 
Table S2). The first annotation cluster includes three 
genes (CDH4, CLSTN2 and NPEPL1), which are 
related to ion binding, specifically metal ion binding. 
Metal ion binding has been found to play an important 
role in the anticancer activity of UK-1 analogs (43). The 
four genes (CDH4, VAMP7, CLSTN2 and SPTB) in the 
second annotation cluster are mainly related to membrane 
or transmembrane proteins, which function as gateways 
to link inside and outside of a cell. Previous cancer 
studies suggest that membrane proteins are related to 
cancer progression (44), and transmembrane genes are 
usually quite important in drug design (45). 

DISCUSSION 

Genomic data have become one of the most valuable re- 
sources of information for understanding the genetic 
mechanisms of cancer (22). Due to the complexity of the 
genomic data, it is challenging to develop a probabilistic 
model that can effectively extract useful information 
from genomic data. The genome-wide association study 
(GWAS) is a powerful approach for identifying cancer- 
related genes, based on comparison of single-nucleotide 
polymorphisms (SNP) in the normal and cancer 
genomes (46^18). The phylogenetic model developed in 
this article is based on the same principle to identify 
cancer-related duplications by comparing the normal 
and tumor genomes. Additionally, the phylogenetic 
model adds a layer of biological realism to the analysis 
that was otherwise not present in the GWAS analysis. 

The phylogenetic model developed in this article is 
designed for genome-wide duplication data analysis. It 
has been shown through simulation that the phylogenetic 
method can accurately estimate the model parameters, 
including duplication and deletion rates. Previous studies 
suggest that the mechanism of cancer is complex and 
may involve multiple biological processes (49). For those 
cases, the analysis based on the phylogenetic model in 
which only duplication and deletion events are considered 
may produce biased results. In the future, we will extend 
the current phylogenetic model by including more biolo- 
gical factors (see for example, 50 in an evolutionary 
context). In the phylogenetic model, we assume that 
genes are independent of each other. This assumption 
may not hold, because several genes might be in the 
same linkage block or under selection for functional 
purposes. Treating genes as independent samples, while 
they are not, may increase the effective sample size and 
thus produce an estimator with an artificially smaller 
variance. In addition, non-independent gene trees may 
bias the estimates of model parameters, especially when 
the recombination events are highly correlated with dupli- 
cation and deletion events. The effect of non-independent 
gene trees depends on the recombination rate of human 
genomes. Non-independent gene trees have been modeled 
for a three-taxon case (51), but it is generally quite difficult 
to deal with non-independent gene trees due to link- 
age disequilibrium. Although we do not deal with non- 
independent gene trees in this article, this issue clearly 
needs more attention. 
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Despite the fact that genomic data from cancer patients 
will become increasingly available, the high cost of 
sequencing whole genomes significantly limits the size of 
such genomic data. The data set analyzed in this article 
contains genomes from only five patients, one or two 
patients for each stage of stomach cancer. We expect 
that the availability of multiple genomes from more 
patients (along with the actual number of gene copies 
for each gene) will significantly improve the estimation 
of model parameters and increase the power for testing 
relevant biological hypotheses about the mechanisms of 
cancer under the phylogenetic model. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online. 
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